{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Create Yelp Reviews data for Sentiment Analysis and Word Embeddings"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Imports & Settings"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-20T21:45:25.651098Z",
     "start_time": "2020-06-20T21:45:25.648545Z"
    }
   },
   "outputs": [],
   "source": [
    "import warnings\n",
    "warnings.filterwarnings('ignore')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-20T21:45:25.883936Z",
     "start_time": "2020-06-20T21:45:25.652790Z"
    }
   },
   "outputs": [],
   "source": [
    "from pathlib import Path\n",
    "import pandas as pd\n",
    "from pandas.io.json import json_normalize"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## About the Data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The data consists of several files with information on the business, the user, the review and other aspects that Yelp provides to encourage data science innovation.\n",
    "\n",
    "The data consists of several files with information on the business, the user, the review and other aspects that Yelp provides to encourage data science innovation. \n",
    "\n",
    "We will use around six million reviews produced over the 2010-2019 period to extract text features. In addition, we will use other information submitted with the review about the user. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Getting the Data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can download the data from [here](https://www.yelp.com/dataset) in json format after accepting the license. The 2020 version has 4.7GB (compressed) and around 10.5GB (uncompressed) of text data.\n",
    "\n",
    "After download, extract the following two of the five `.json` files into to `./yelp/json`:\n",
    "- the `yelp_academic_dataset_user.json`\n",
    "- the `yelp_academic_dataset_reviews.json`\n",
    "\n",
    "Rename both files by stripping out the `yelp_academic_dataset_` prefix so you have the following directory structure:\n",
    "```\n",
    "data\n",
    "|-create_yelp_review_data.ipynb\n",
    "|-yelp\n",
    "    |-json\n",
    "        |-user.json\n",
    "        |-reviews.json\n",
    "```\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-20T21:45:25.887070Z",
     "start_time": "2020-06-20T21:45:25.885011Z"
    }
   },
   "outputs": [],
   "source": [
    "yelp_dir = Path('yelp')\n",
    "\n",
    "if not yelp_dir.exists():\n",
    "    parquet_dir.mkdir(exist_ok=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Parse json and store as parquet files"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Convert json to faster parquet format:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-20T21:49:46.038635Z",
     "start_time": "2020-06-20T21:45:25.888086Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "review\n",
      "user\n"
     ]
    }
   ],
   "source": [
    "for fname in ['review', 'user']:\n",
    "    print(fname)\n",
    "    \n",
    "    json_file = yelp_dir / 'json' / f'{fname}.json'\n",
    "    parquet_file = yelp_dir / f'{fname}.parquet'\n",
    "    if parquet_file.exists():\n",
    "        print('\\talready exists')\n",
    "        continue\n",
    "\n",
    "    data = json_file.read_text(encoding='utf-8')\n",
    "    json_data = '[' + ','.join([l.strip()\n",
    "                                for l in data.split('\\n') if l.strip()]) + ']\\n'\n",
    "    data = json.loads(json_data)\n",
    "    df = json_normalize(data)\n",
    "    if fname == 'review':\n",
    "        df.date = pd.to_datetime(df.date)\n",
    "        latest = df.date.max()\n",
    "        df['year'] = df.date.dt.year\n",
    "        df['month'] = df.date.dt.month\n",
    "        df = df.drop(['date', 'business_id', 'review_id'], axis=1)\n",
    "    if fname == 'user':\n",
    "        df.yelping_since = pd.to_datetime(df.yelping_since)\n",
    "        df = (df.assign(member_yrs=lambda x: (latest - x.yelping_since)\n",
    "                        .dt.days.div(365).astype(int))\n",
    "              .drop(['elite', 'friends', 'name', 'yelping_since'], axis=1))\n",
    "    df.dropna(how='all', axis=1).to_parquet(parquet_file)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now you can remove the json files."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-20T21:50:59.700866Z",
     "start_time": "2020-06-20T21:50:59.696410Z"
    }
   },
   "outputs": [],
   "source": [
    "def merge_files(remove=False):\n",
    "    combined_file = yelp_dir / 'user_reviews.parquet'\n",
    "    if not combined_file.exists():\n",
    "        user = pd.read_parquet(yelp_dir / 'user.parquet')\n",
    "        print(user.info(null_counts=True))\n",
    "\n",
    "        review = pd.read_parquet(yelp_dir / 'review.parquet')\n",
    "        print(review.info(null_counts=True))\n",
    "\n",
    "        combined = (review.merge(user, on='user_id',\n",
    "                                 how='left', suffixes=['', '_user'])\n",
    "                    .drop('user_id', axis=1))\n",
    "        combined = combined[combined.stars > 0]\n",
    "        print(combined.info(null_counts=True))\n",
    "        combined.to_parquet(yelp_dir / 'user_reviews.parquet')\n",
    "    else:\n",
    "        print('already merged')\n",
    "    if remove:\n",
    "        for fname in ['user', 'reviews']:\n",
    "            f = yelp_dir / (fname + '.parquet')\n",
    "            if f.exists():\n",
    "                f.unlink()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-20T21:51:47.956186Z",
     "start_time": "2020-06-20T21:51:00.042343Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pandas.core.frame.DataFrame'>\n",
      "RangeIndex: 1968703 entries, 0 to 1968702\n",
      "Data columns (total 19 columns):\n",
      " #   Column              Non-Null Count    Dtype  \n",
      "---  ------              --------------    -----  \n",
      " 0   user_id             1968703 non-null  object \n",
      " 1   review_count        1968703 non-null  int64  \n",
      " 2   useful              1968703 non-null  int64  \n",
      " 3   funny               1968703 non-null  int64  \n",
      " 4   cool                1968703 non-null  int64  \n",
      " 5   fans                1968703 non-null  int64  \n",
      " 6   average_stars       1968703 non-null  float64\n",
      " 7   compliment_hot      1968703 non-null  int64  \n",
      " 8   compliment_more     1968703 non-null  int64  \n",
      " 9   compliment_profile  1968703 non-null  int64  \n",
      " 10  compliment_cute     1968703 non-null  int64  \n",
      " 11  compliment_list     1968703 non-null  int64  \n",
      " 12  compliment_note     1968703 non-null  int64  \n",
      " 13  compliment_plain    1968703 non-null  int64  \n",
      " 14  compliment_cool     1968703 non-null  int64  \n",
      " 15  compliment_funny    1968703 non-null  int64  \n",
      " 16  compliment_writer   1968703 non-null  int64  \n",
      " 17  compliment_photos   1968703 non-null  int64  \n",
      " 18  member_yrs          1968703 non-null  int64  \n",
      "dtypes: float64(1), int64(17), object(1)\n",
      "memory usage: 285.4+ MB\n",
      "None\n",
      "<class 'pandas.core.frame.DataFrame'>\n",
      "RangeIndex: 8021122 entries, 0 to 8021121\n",
      "Data columns (total 8 columns):\n",
      " #   Column   Non-Null Count    Dtype  \n",
      "---  ------   --------------    -----  \n",
      " 0   user_id  8021122 non-null  object \n",
      " 1   stars    8021122 non-null  float64\n",
      " 2   useful   8021122 non-null  int64  \n",
      " 3   funny    8021122 non-null  int64  \n",
      " 4   cool     8021122 non-null  int64  \n",
      " 5   text     8021122 non-null  object \n",
      " 6   year     8021122 non-null  int64  \n",
      " 7   month    8021122 non-null  int64  \n",
      "dtypes: float64(1), int64(5), object(2)\n",
      "memory usage: 489.6+ MB\n",
      "None\n",
      "<class 'pandas.core.frame.DataFrame'>\n",
      "Int64Index: 8021122 entries, 0 to 8021121\n",
      "Data columns (total 25 columns):\n",
      " #   Column              Non-Null Count    Dtype  \n",
      "---  ------              --------------    -----  \n",
      " 0   stars               8021122 non-null  float64\n",
      " 1   useful              8021122 non-null  int64  \n",
      " 2   funny               8021122 non-null  int64  \n",
      " 3   cool                8021122 non-null  int64  \n",
      " 4   text                8021122 non-null  object \n",
      " 5   year                8021122 non-null  int64  \n",
      " 6   month               8021122 non-null  int64  \n",
      " 7   review_count        8021122 non-null  int64  \n",
      " 8   useful_user         8021122 non-null  int64  \n",
      " 9   funny_user          8021122 non-null  int64  \n",
      " 10  cool_user           8021122 non-null  int64  \n",
      " 11  fans                8021122 non-null  int64  \n",
      " 12  average_stars       8021122 non-null  float64\n",
      " 13  compliment_hot      8021122 non-null  int64  \n",
      " 14  compliment_more     8021122 non-null  int64  \n",
      " 15  compliment_profile  8021122 non-null  int64  \n",
      " 16  compliment_cute     8021122 non-null  int64  \n",
      " 17  compliment_list     8021122 non-null  int64  \n",
      " 18  compliment_note     8021122 non-null  int64  \n",
      " 19  compliment_plain    8021122 non-null  int64  \n",
      " 20  compliment_cool     8021122 non-null  int64  \n",
      " 21  compliment_funny    8021122 non-null  int64  \n",
      " 22  compliment_writer   8021122 non-null  int64  \n",
      " 23  compliment_photos   8021122 non-null  int64  \n",
      " 24  member_yrs          8021122 non-null  int64  \n",
      "dtypes: float64(2), int64(22), object(1)\n",
      "memory usage: 1.6+ GB\n",
      "None\n"
     ]
    },
    {
     "ename": "TypeError",
     "evalue": "unsupported operand type(s) for +: 'PosixPath' and 'str'",
     "output_type": "error",
     "traceback": [
      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[0;31mTypeError\u001b[0m                                 Traceback (most recent call last)",
      "\u001b[0;32m<ipython-input-9-892254c45114>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mmerge_files\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mremove\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mTrue\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
      "\u001b[0;32m<ipython-input-8-159c5a2caa96>\u001b[0m in \u001b[0;36mmerge_files\u001b[0;34m(remove)\u001b[0m\n\u001b[1;32m     18\u001b[0m     \u001b[0;32mif\u001b[0m \u001b[0mremove\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m     19\u001b[0m         \u001b[0;32mfor\u001b[0m \u001b[0mfname\u001b[0m \u001b[0;32min\u001b[0m \u001b[0;34m[\u001b[0m\u001b[0;34m'user'\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m'reviews'\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 20\u001b[0;31m             \u001b[0mf\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0myelp_dir\u001b[0m \u001b[0;34m/\u001b[0m \u001b[0mfname\u001b[0m \u001b[0;34m+\u001b[0m \u001b[0;34m'.parquet'\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m     21\u001b[0m             \u001b[0;32mif\u001b[0m \u001b[0mf\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mexists\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m     22\u001b[0m                 \u001b[0mf\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0munlink\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;31mTypeError\u001b[0m: unsupported operand type(s) for +: 'PosixPath' and 'str'"
     ]
    }
   ],
   "source": [
    "merge_files(remove=True)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python [conda env:ml4t] *",
   "language": "python",
   "name": "conda-env-ml4t-py"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.7"
  },
  "toc": {
   "base_numbering": 1,
   "nav_menu": {},
   "number_sections": true,
   "sideBar": true,
   "skip_h1_title": true,
   "title_cell": "Table of Contents",
   "title_sidebar": "Contents",
   "toc_cell": false,
   "toc_position": {},
   "toc_section_display": true,
   "toc_window_display": true
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
